Learning simple functions

Today, our goal is to see how well these machine learning systems fair when trying to predict some fairly simple functions. We will look at 5 functions: exponential, logarithmic, linear, inverse, and sine. On the way there, we will stop at interesting observations and experiment more with them, to really get a feel for how these systems respond.

Here's what we're gonna wiggle around:

  • Learning rate: 0.1, 0.01, 0.001
  • Network depth: 0, 1, 2, 3, and 4
  • Activation functions: sigmoid and ReLU
  • Dropout rate
  • Batch size

We will check whether the network can actually predict something that it didn't know about, what's the default behavior when it can't predict, is that default acceptable and why does it behave like that.

Another interesting trick we can do is try to learn a function from a narrow domain, and slowly shift/expand that domain. How will our network respond to that? What insights can we get to build networks that are robust to such distributional shift?

Throughout all of this, we also will be discussing is it realistic to expect agents to exhibit behaviors we want as humans without human priors?

We will also see where the threshold location where these architectures start to break.

First, let's get the party started

In [1]:
import torch
import torch.optim as optim
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader

Here we're defining custom datasets, to easily create and load them onto data loaders

In [2]:
class FunctionDataset(Dataset):
    def __init__(self, function: callable, start: float=-5, stop: float=5, samples: int=300):
        self.function = function
        self.start = start
        self.stop = stop
        self.samples = samples
    def __len__(self):
        return self.samples
    def __getitem__(self, index):
        x = index/self.samples * (self.stop - self.start) + self.start
        return x, self.function(x)

Here are some function definitions that we will look at

In [537]:
expF = lambda x: torch.exp(x)
logF = lambda x: torch.log(x)
invF = lambda x: 1 / x
linF = lambda x: 2 * x + 8
sinF = lambda x: torch.sin(x)
expDl = DataLoader(FunctionDataset(lambda x: np.exp(x), samples=10000), batch_size=1280)
logDl = DataLoader(FunctionDataset(lambda x: np.log(x), samples=10000), batch_size=1280)
invDl = DataLoader(FunctionDataset(invF, samples=10001), batch_size=1280)
linDl = DataLoader(FunctionDataset(linF, samples=10000), batch_size=1280)
sinDl = DataLoader(FunctionDataset(lambda x: np.sin(x), samples=10000), batch_size=1280)

This is our network definition. It's a simple multilayer network, consisting of fully connected layers (the "fc" shorthand all over the place), activation and dropout right outside of it. It doesn't get any more simpler than this. We will also make plotting stuff extremely streamlined, to have fast feedback loops.

In [671]:
# simple (#batch, 1) to (#batch, 1)
class NN(nn.Module):
    def __init__(self, hiddenDim=10, hiddenLayers=2, dropout_p=0, useReLU=True):
        super().__init__()
        self.fc_begin = nn.Linear(1, hiddenDim)
        self.fc1 = nn.Linear(hiddenDim, hiddenDim)
        self.fc2 = nn.Linear(hiddenDim, hiddenDim)
        self.fc3 = nn.Linear(hiddenDim, hiddenDim)
        self.fc4 = nn.Linear(hiddenDim, hiddenDim)
        self.fc_end = nn.Linear(hiddenDim, 1)
        self.activation = nn.ReLU() if useReLU else nn.Sigmoid()
        self.dropout = nn.Dropout(dropout_p)
        self.totalLosses = []
        self.hiddenLayers = hiddenLayers
        pass
    def forward(self, x):
        x = self.dropout(self.activation(self.fc_begin(x)))
        # really ad-hoc way of doing this, but pytorch doesn't allow me to bunch this up into a list
        if self.hiddenLayers >= 1: x = self.dropout(self.activation(self.fc1(x)))
        if self.hiddenLayers >= 2: x = self.dropout(self.activation(self.fc2(x)))
        if self.hiddenLayers >= 3: x = self.dropout(self.activation(self.fc3(x)))
        if self.hiddenLayers >= 4: x = self.dropout(self.activation(self.fc4(x)))
        x = self.fc_end(x)
        return x
    
    def train(self, dl, lossFunction=nn.MSELoss(), optimizer=None, lr=0.01, epochs=500):
        if optimizer == None:
            optimizer = optim.Adam(self.parameters(), lr=lr)
        for epoch in range(epochs):
            totalLoss = 0
            for x, y in dl:
                optimizer.zero_grad()
                x = x.view(-1, 1).float().cuda()
                output = self.forward(x)
                loss = lossFunction(output, y.view(-1, 1).float().cuda())
                loss.backward()
                totalLoss += loss.item()
                optimizer.step()
            totalLoss /= dl.batch_size
            self.totalLosses.append(totalLoss)
    def plot(self, x, function: callable=None):
        plt.figure(num=None, figsize=(10, 6), dpi=350)
        x = x.view(-1, 1)
        plt.subplot(2, 1, 1)
        plt.plot(x.squeeze(), self(x.cuda()).detach().cpu().squeeze(), ".")
        plt.legend(["Learned"])
        if function != None:
            plt.subplot(2, 1, 2)
            plt.plot(x, function(x), ".")
            plt.legend(["Real"])
        plt.show()
    def plotLosses(self, begin=0, end=0):
        plt.figure(num=None, figsize=(10, 3), dpi=350)
        if end == 0:
            end = len(self.totalLosses)
        plt.plot(range(len(self.totalLosses))[begin:end], self.totalLosses[begin:end])
        plt.legend(["Loss"])
        plt.show()

Basic demonstration

Let's see how a basic training process looks like, before dissecting everything. First we can create our network and put it in the GPU:

In [704]:
net = NN().cuda()
In [705]:
net.train(expDl, epochs=300)
net.plot(torch.linspace(-5, 5, 300), expF)
net.plotLosses()

Running the above command for a total of 300 epochs, the network with default settings is indeed capable of learning a function. Now that we know this works, let's try breaking it!

Untrained networks

We initialize our networks and plot out the prediction of them before any training:

In [492]:
for i in range(3):
    NN().cuda().plot(torch.linspace(-5, 5, 300))

Those are the small scale features. Looks pretty random to me. However, if you look closely, this actually looks like it's comprised of multiple smaller line segments. The intuition for how ReLU works can be that as the network trains, these segments endpoints and length smoothly changes. How about large scale features?

In [493]:
for i in range(3):
    NN().cuda().plot(torch.linspace(-100, 100, 300))

Hmmmmm, the large scale features seem to always be a linear relationship. Should we expect this? I think so, considering we are using ReLU as our activation function. Let's try to use sigmoid, to see if there're any real difference.

This is the small scale features:

In [506]:
for i in range(3):
    NN(useReLU=False).cuda().plot(torch.linspace(-5, 5, 300))

Most small scale features looks like the typical logistic curve you usually see. Let's see the medium scale features:

In [509]:
for i in range(3):
    NN(useReLU=False).cuda().plot(torch.linspace(-20, 20, 300))

There's definitely more interesting lines and curves here. Let's see the large scale features:

In [510]:
for i in range(3):
    NN(useReLU=False).cuda().plot(torch.linspace(-100, 100, 300))

Huh, they seems to always approach to a specific value when the inputs gets too large. At first glance, this shouldn't make sense, because we have a fully connected layer at the end, and thus the values should be able to jump wildly, to any scale, and thus why does it approach some specific value at infinity.

But at closer consideration, sigmoid approaches to specific values, namely 0 and 1, and thus the sigmoid activations when multiplied with a constant, should also approach a specific value. This doesn't necessarily mean that it can't handle functions that operate at large values.

One thing worh noting though. Although sigmoid is not the most symmetrical of functions, the output here shows that sometimes all predictions are positive, sometimes all predictions are negative. There doesn't seem to be an inherent non-symmetrical feature. This gives us confidence to use other wild functions, like leaky ReLU, and know that the initial network is probably unbiased.

Training

Let's move on to training the network itself. Let's step over 50 epochs each and see what the network has predicted:

In [680]:
net = NN().cuda()
for i in range(6):
    net.plot(torch.linspace(-5, 5, 300))
    net.train(expDl, epochs=15)
    print(f"Epoch: {i*15+15}")
net.plot(torch.linspace(-5, 5, 300), expF)
net.plotLosses()
Epoch: 15
Epoch: 30
Epoch: 45
Epoch: 60
Epoch: 75
Epoch: 90

What did we see here? When there's no training at all, the predictions are wild. When there's a little bit of training, it very quickly jumps to roughly the same shape as the exponential. Also note at the scale it is snapping here. At the beginning, the predictions are 0.4 to 0.6, quite near 0, but after training for a little, the predictions immediately jumped to the 60 scale, which is quite drastic.

Then it takes quite a while to slowly but surely approach the exponential. The shape when it was trained a little looks too close to a ReLU function. Could this be that the large initial improvement velocity is caused by snapping into the shape of the activation function used? If this is the case, then we would expect the snapping to happen in reverse too, right? With the $e^{-x}$ function I mean. Let's try it out:

In [681]:
expNF = lambda x: torch.exp(-x)
expNDl = DataLoader(FunctionDataset(lambda x: np.exp(-x), samples=10000), batch_size=1280)
net = NN().cuda()
for i in range(6):
    net.plot(torch.linspace(-5, 5, 300))
    net.train(expNDl, epochs=15)
    print(f"Epoch: {i*15+15}")
net.plot(torch.linspace(-5, 5, 300), expNF)
net.plotLosses()
Epoch: 15
Epoch: 30
Epoch: 45
Epoch: 60
Epoch: 75
Epoch: 90

Okay wow, I was not expecting to be right. Really looks like this is what's happening. Let's go back to the code beforehand and see what's going on in between epochs 0 and 15. How quickly was the snap?

In [682]:
net = NN().cuda()
for i in range(6):
    net.plot(torch.linspace(-5, 5, 300))
    net.train(expDl, epochs=1)
    print(f"Epoch: {i*1+1}")
net.plot(torch.linspace(-5, 5, 300), expF)
net.plotLosses()
Epoch: 1
Epoch: 2
Epoch: 3
Epoch: 4
Epoch: 5
Epoch: 6

Really fast, it turns out. Although visually it looks like a very dramatic improvement, our loss graph shows us that the loss is not decreasing the fastest. What's up with that?

Also, how can we break the model and our conclusion about snaps? Well, what's a function that's sort of extreme around 0, doesn't look like a ReLU function at all, preferably have to be approximated by 2 network with ReLUs that have opposing schema? The inverse function seems to be a good candidate. Let's test this out:

In [683]:
net = NN().cuda()
for i in range(6):
    net.plot(torch.linspace(-5, 5, 300))
    net.train(invDl, epochs=15)
    print(f"Epoch: {i*15+15}")
net.plot(torch.linspace(-5, 5, 300), invF)
net.plotLosses()
Epoch: 15
Epoch: 30
Epoch: 45
Epoch: 60
Epoch: 75
Epoch: 90

Very nice. The network figured out the general structure quite well, even when trained on only the first 15 epochs. Let's zoom in:

In [687]:
net = NN().cuda()
for i in range(6):
    net.plot(torch.linspace(-5, 5, 300))
    net.train(invDl, epochs=5)
    print(f"Epoch: {i*5+5}")
net.plot(torch.linspace(-5, 5, 300), invF)
net.plotLosses()
Epoch: 5
Epoch: 10
Epoch: 15
Epoch: 20
Epoch: 25
Epoch: 30

The snapping behavior is still there, but it feels different from before. Now I think about the snapping behavior is due to attraction to a high entropy zone near 0 (but with connections to much lower entropy zones nearby), due to everything in the network being initialized to near 0. Thus it wants to make the bending point in the ReLU function to move to the point of maximum possible change, aka the high entropy zone at 0. Then when it's at this point, the geometry can change drastically and can actually start to fit the function.

This also means that it can't really do anything until the network has dragged itself to this high entropy location. This is also probably why you should normalize your data everywhere you look, so it doesn't have to drag itself over to the true mean.

This also means that networks trained on 1 function will take a lot of time to just drag itself to the high entropy zone, then approximate the new function again. For a similar enough function, there might be a shorter, more direct path than going to the high entropy zone, but should still have some attraction to that zone. In that case, when we slowly increase the difference between the 2 function's geometries, we should see that we're progressing to the case below.

If 2 functions are very different from each other, we expect that transfer learning wouldn't work, and it will necessarily need to go to the high entropy zone to progress at all, and thus while moving there, it will appears to us as being metastable (like the Higgs field being in a false vacuum) and will probably bore us, tricking us into thinking that it can't improve much more. To test this, we can setup those exact 2 functions and expect to see this metastable behavior.

Finally, are there other hindrances that might manifest itself as difficulties in transfer learning? What's the intuition behind them and how do they feel like?

We will test out all of the above later on in the transfer learning section, but now let's focus on other things.

Attraction to high entropy zone

Let's try to see more clearly this behavior of the network trying to move to the high entropy zone with much lower entropy zones nearby. Let's try to predict the exponential function, but this time in the $(-5, 7)$ interval:

In [672]:
exp7Dl = DataLoader(FunctionDataset(lambda x: np.exp(x), samples=10000, stop=7), batch_size=1280)
interval = torch.linspace(-10, 7, 300)
net = NN().cuda()
for i in range(6):
    net.plot(interval)
    net.train(exp7Dl, epochs=15)
    print(f"Epoch: {i*15+15}")
for i in range(6):
    net.plot(interval)
    net.train(exp7Dl, epochs=30)
    print(f"Epoch: {i*30+120}")
net.plot(interval, expF)
net.plotLosses()
Epoch: 15
Epoch: 30
Epoch: 45
Epoch: 60
Epoch: 75
Epoch: 90
Epoch: 120
Epoch: 150
Epoch: 180
Epoch: 210
Epoch: 240
Epoch: 270

Yep, looks like this is what it's doing.

Sigmoid snapping

Does sigmoid function exhibit such behaviors?

In [673]:
net = NN(useReLU=False).cuda()
for i in range(6):
    net.plot(torch.linspace(-5, 5, 300))
    net.train(expDl, epochs=30)
    print(f"Epoch: {i*30+30}")
net.plot(torch.linspace(-5, 5, 300), expF)
net.plotLosses()
Epoch: 30
Epoch: 60
Epoch: 90
Epoch: 120
Epoch: 150
Epoch: 180

Yes, seems like it does. Also notice how we need considerably more epochs to get roughly where we need it to go? Sigmoid doesn't seem to be cooperating with us. How about $e^{-x}$?

In [679]:
net = NN(useReLU=False).cuda()
for i in range(6):
    net.plot(torch.linspace(-5, 5, 300))
    net.train(expNDl, epochs=30)
    print(f"Epoch: {i*30+30}")
net.plot(torch.linspace(-5, 5, 300), expNF)
net.plotLosses()
Epoch: 30
Epoch: 60
Epoch: 90
Epoch: 120
Epoch: 150
Epoch: 180

Yep, same thing happened here. Still not cooperating with us, and still have that snapping behavior.

Inference

But can it do inference? Can it infer the structure of the exponential function outside of the range we specify? Let's see:

In [678]:
net = NN().cuda()
for i in range(6):
    net.plot(torch.linspace(-10, 10, 300))
    net.train(expDl, epochs=30)
    print(f"Epoch: {i*30+30}")
net.plot(torch.linspace(-10, 10, 300), expF)
net.plotLosses()
Epoch: 30
Epoch: 60
Epoch: 90
Epoch: 120
Epoch: 150
Epoch: 180

Doesn't look too good. It predicted the function perfectly fine within our dataset in $(-5, 5)$ interval. It predicted the $(-\infty, -5)$ interval perfectly fine though, but it's just luck, because ReLU sort of prefer lines more, and $(-\infty, -5)$ interval is just as flat as the $(-5, 0)$ interval.

Another interesting phenomenon we can see is that the $(5, 10)$ interval has exactly the same slope when the network's geometry was first learned. At epoch 30, the slope on $(5, 10)$ interval is $\frac{200-100}{10-5}=20$, and at epoch 180, the slope on the same interval is $\frac{250-150}{10-5}=20$. So we should expect all unpredicted segments to retain the characteristic geometry of the activation function.

For sigmoid, that means we should expect everything to just plateau for $x\geq 5$. Is this true though?

In [674]:
net = NN(useReLU=False).cuda()
for i in range(6):
    net.plot(torch.linspace(-10, 10, 300))
    net.train(expDl, epochs=200)
    print(f"Epoch: {i*50+50}")
net.plot(torch.linspace(-10, 10, 300), expF)
net.plotLosses()
Epoch: 50
Epoch: 100
Epoch: 150
Epoch: 200
Epoch: 250
Epoch: 300

Yep, exactly as predicted. Now, what if we pretrained the model with values in $(-5, 5)$, then broaden that to train with values in $(-5, 7)$, will it help in predicting what happens in interval $(-5, 10)$? I think not, but it's interesting to see what happens anyway:

Transfer learning

In [675]:
exp7Dl = DataLoader(FunctionDataset(lambda x: np.exp(x), samples=10000, stop=7), batch_size=1280)
interval = torch.linspace(-10, 7, 300)
net = NN().cuda()
print("Before training")
net.plot(interval)
net.train(expDl, epochs=180)
print("After training for 180 epochs")
for i in range(6):
    net.plot(interval)
    net.train(exp7Dl, epochs=15)
    print(f"Epoch: {i*15+180}")
net.plot(interval, expF)
net.plotLosses()
Before training
After training for 180 epochs
Epoch: 180
Epoch: 195
Epoch: 210
Epoch: 225
Epoch: 240
Epoch: 255

Seems like transfer learning actually works, and it can really approximate the expanded function. Notice there is still attraction to the high entropy zone. And more astonishingly, the network sacrificed the geometry to get itself back to a normal ReLU, then dragged itself over to the new high entropy zone, then relearn the geometry.

Now let's see how it performs on the $(-5, 13)$ interval:

In [676]:
interval = torch.linspace(-10, 13, 300)
net = NN().cuda()
print("Before training")
net.plot(interval)
net.train(expDl, epochs=180)
print("After training for 180 epochs")
for i in range(6):
    net.plot(interval)
    net.train(exp7Dl, epochs=15)
    print(f"Epoch: {i*15+195}")
net.plot(interval, expF)
net.plotLosses()
Before training
After training for 180 epochs
Epoch: 195
Epoch: 210
Epoch: 225
Epoch: 240
Epoch: 255
Epoch: 270

It still couldn't do it. We're expecting too much out of it. This extends to humans too. Human brain through evolution thinks linearly, thus we can't really "feel" how exponentials work and regularly downplay exponential trends. This shows up pretty much everywhere. And if the network can talk back to us, ReLU will say "Hey man, I think linearly, and I'm only trying to approximate what you're giving me. How the hell am I supposed to know that it's supposed to look like an exponential? Exponential is only relevant in your world, with exponential progress. Why do you expect me to know that?"

Another repeating pattern is that the trailing slope for $(8, 13)$ at epoch 270 is $\frac{2000-1250}{13-8}=150$, while the trailing slope for $(5, 10)$ at epoch 180 os $\frac{300-150}{10-5}=30$. So the geometry stays the same as the activation's default, but exactly the parameters of the geometry can be different.

Btw, is transfer learning from $(-5, 5)$ to $(-5, 7)$ really faster than just learning $(-5, 7)$ directly? Let's give it a test:

In [677]:
net = NN().cuda()
for i in range(6):
    net.plot(torch.linspace(-10, 7, 300))
    net.train(exp7Dl, epochs=45)
    print(f"Epoch: {i*45+45}")
net.plot(torch.linspace(-10, 7, 300), expF)
net.plotLosses()
Epoch: 45
Epoch: 90
Epoch: 135
Epoch: 180
Epoch: 225
Epoch: 270

Here we're still seeing it dragging itself to the high entropy zone, and then adjust the geometry. This is still slower than the one that uses transfer learning, not so much because of the already existing geometry, but because it's closer to the high entropy zone.

Transfer learning, but with completely different distributions

Previously, we sort of conjectured that the network should have a very bad day trying to approximate another function when it was trained with a very different function. So let's try to predict $e^{-x}$ using a network trained with $e^x$. This one is a little bit more interesting, so let's make a function to call it quickly.

In [611]:
def transferLearning1():
    interval = torch.linspace(-5, 5, 300)
    net = NN().cuda()
    print("Before training")
    net.plot(interval)
    net.train(expDl, epochs=180)
    print("After training for 180 epochs")
    for i in range(6):
        net.plot(interval)
        net.train(expNDl, epochs=30)
        print(f"Epoch: {i*30+180+30}")
    net.plot(interval, expNF)
    net.plotLosses()

1) This is expected behavior, very quick snapping towards the opposite direction, dragging itself over, and reestablish the geometry, but notice how the segments are a bit too straight. It seems as if the flexible segments of the network has gone outside of the interval area, and thus is paralyzed and can't help smooth out the curve to the left.

This is moderately easy to reproduce.

In [688]:
transferLearning1()
Before training
After training for 180 epochs
Epoch: 210
Epoch: 240
Epoch: 270
Epoch: 300
Epoch: 330
Epoch: 360

2) This is also expected behavior, but it looks more dynamic and the network is not as paralyzed. This actually happens quite a lot.

In [691]:
transferLearning1()
Before training
After training for 180 epochs
Epoch: 210
Epoch: 240
Epoch: 270
Epoch: 300
Epoch: 330
Epoch: 360

3) This one looks like a migrating wave. Notice how the $(-5, 0.5)$ interval kinda freezes and can't go anywhere? It looks as if it's a supercooled liquid. A supercooled liquid is a liquid that is below it's freezing temperature, but doesn't freeze, because there are no dislocation/nucleation sites and no seed crystal has been introduced. This is a super cool (get it? haha) phenomenon btw, I suggest you check it out.

Anyway, the "wave" on the right travelling to the left sort of behaves like an introduced seed crystal. This one is quite hard to reproduce. A lot of times, it masks itself as the 1$^{st}$ behavior, and it masks too fast to actually observe any wave migrating.

In [702]:
transferLearning1()
Before training
After training for 180 epochs
Epoch: 210
Epoch: 240
Epoch: 270
Epoch: 300
Epoch: 330
Epoch: 360

4) This behavior is like the 3$^{rd}$ case, but it seems like the seed crystal has just given up on correcting the network, so the network just kinda sits there, unchanging, and dead. This is also quite common.

In [689]:
transferLearning1()
Before training
After training for 180 epochs
Epoch: 210
Epoch: 240
Epoch: 270
Epoch: 300
Epoch: 330
Epoch: 360

5) This one looks ziczacy. On another lost test run, the ziczac looks too perfect, and the fact it remained that way for so long, without any disturbances. But this behavior is quite rare, the one below is more often seen though.

In [701]:
transferLearning1()
Before training
After training for 180 epochs
Epoch: 210
Epoch: 240
Epoch: 270
Epoch: 300
Epoch: 330
Epoch: 360

So after all of this, it seems like the only stable way to reach $e^{-x}$ from $e^x$ is to either move to the high entropy spot and hope to start over (behavior 1 and 2), have some weird tunneling capability through the latent space (behavior 3 and 5), or it can't do it at all (behavior 4).

It's worth pointing out that the weird tunneling capability through the latent space is not very suited to learn. The network is sort of trapped in a local pocket and it has to look very, very hard to actually tunnel through.

What do we learned from this? Sometimes it's not worth it to use transfer learning, and it can actually hamper progress if you abuse it too much by giving it a totally strange distribution.

Also this sort of gives us an important technique to know whether we're in a strange pocket in the latent space or not. This allows us to create agents that spawn other agents that are vulnerable to distributional shift, but the master agent can detect that, and can kill that agent and create another one from scratch. This also sort of tells us that current systems that just continuously takes in new data and never refresh from scratch will ultimately have parts of the network paralyzed.

We can also expect that possible solutions to this is by using an activation function that does not dimishish the gradient signal but also can't be completely paralyzed. Something like Leaky ReLU. We can also expect L2 regularization and dropout to help because they physically smooth out the landscape more, to allow more weird tunneling capability.

Afterword

There are many things we still haven't tried out yet. Here is the list of things we can do next time:

  • Do everything, but this time for sine function.
  • Analyze the loss. Seems to always be decreasing exponentially from preliminary tests.
  • Network exhaustion point. From the transfer learning with different distribution section, it seems like we can push nearly all the network's capability to a single spot, then sort of throw them away, and the resulting network is frozen. Now, we can dive into this more, try to reproduce it reliably, and see what is the theoretical information limit that can be processed inside a certain network, and compare that to human brains.
  • Do leaky ReLU and L2 regularization, to see if it helps.
  • Try to predict functions with interesting spots way out in weird intervals (like $(200, 210)$). The high entropy zone is over there, so can the networks actually move there without exhausting itself? This is sort of like a human who doesn't understand cyclical dependencies.
  • Try to make agents that can automatically detect and deal with distributional shift.
  • All ReLU networks today seems to be made up of distinct segments. We can automatically detect those segments and graph them as a different color, then redo our experiments to figure out where is all the network's capability.
  • Finally, we have not specified transfer learning behavior very concretely, so are there other hindrances that might manifest itself as difficulties in transfer learning?

Why are we doing this? Can't we just follow guidelines on how to do convolutional networks, markov agents, etc... and be done with it? Well I believe that current networks are far too complicated, and that the true road to AGI will be stupidly simple. By really getting the feel for how these systems work at the fundamental, we may actually discover new knowledge.

Same thing happened at the LHC recently. For the uninitiated, the Large Hadron Collider is the biggest, most badass particle collider in the world. It smashes stuff together stupidly fast, and generate stupidly high temperatures, on par with the temperature of the universe right after it was born. Besides the Higgs, the LHC has not made any significant discoveries for over 10 years now. A group of scientists there try to very accurately measure everything about the bottom quark, to astounding precision, to hopefully detect something out of the ordinary. If we don't do the same and really try to break our own systems, we won't be able to discover something new, so what's the point?